{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Linear Regression \n",
    "\n",
    "## Introduction \n",
    "\n",
    "* 什麼是回歸，什麼是線性迴歸\n",
    "* 估計方法\n",
    "<a href = \"https://www.kdnuggets.com/2016/11/machine-learning-vs-statistics.html\"> Aatash Shah曾在他的文章中作过这样的定义：</a>\n",
    "\n",
    "> * “机器学习”是一种能够直接从数据中学习，而无需依赖规则编程的算法。\n",
    "\n",
    "> * “建立统计模型”的意思是以数学方程式来表示数据变量间的关系。\n",
    "\n",
    "實際上，機器學習的很多 fashion advanced 的算法都是基於線性回歸的，所以，在能夠跟進一步了解其他算法之前，我們需要好好了解一下甚麼是線性迴歸。\n",
    "\n",
    "### 什麼是回歸 \n",
    "\n",
    "在现实世界中存在大量这样的情况：两个或多个变量之间有一些联系，但是没有确切关系（没有确切到可以严格决定的程度），例如：人的身高$X$和體重$Y$有關係，一般表現為$X$增大時$Y$也傾向於增大，但由$X$並不能嚴格決定$Y$；一種農作物的畝產量$Y$與其播種量$X_1$、施肥量$X_2$，等等有關，但是$X Series$並不能嚴格確定$Y$。\n",
    "\n",
    "因此，根據上述例子，$Y$一般被稱為**因變量**或者回歸變量(regressand)，而$X$被稱為回歸量、自變量（independent variable, regressors)。\n",
    "\n",
    "為什麼$X$不能嚴格決定$Y$？很簡單，因為自變量太多了，而我們是無法窮盡（大部分情況）這些自變量，而且如果窮盡了這些自變量，很容易發生「過擬合」的情況(overfitted)；另外，$X$是那些跟$Y$有關的變量，而非絕對的因果關係，如果從英文中，我們可以很容易理解這個「關係」：correlation & causality. \n",
    "\n",
    "現在我們家者一個問題中有因變量$Y$以及自變量$X_1, X_2,\\dots, X_p$.可以設想$Y$的值由兩部分構成，一部分由$X_1, X_2,\\dots, X_p$的影響所導致，一部分是由未知因素導致，我們將第二部分歸咎為「誤差」(error)，這樣一來，我們就可以得到一個式子：\n",
    "$$Y = f(X) + \\epsilon \\ \\ \\ \\ X = (X_1, X_2,\\dots, X_p), \\ \\ \\epsilon = \\text{error}$$\n",
    "作為隨機誤差，我們要求它的期望為零：$\\displaystyle E(\\sum_i^p \\epsilon_i) = 0$\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 線性迴歸 (Linear Regression)\n",
    "\n",
    "在這裏，我們只討論回歸函數$f(x)$為線型函數的情況（包括可以轉化成線性函數的情況），我們稱其為：線性迴歸(Linear Regression),在這一節，我們只討論簡單線性迴歸，就是一元線性迴歸(Simple Linear Regression)，我們只討論還有一個自變量$X$（因變量只有也只會有一個$Y$）。\n",
    "\n",
    "$\\begin{align*}\n",
    "    Y &=& \\beta_0 + \\beta_1 X + \\epsilon \\\\ \n",
    "    E(\\epsilon) = 0 \\ \\ \\ Var(\\epsilon) = \\sigma^2 \\\\\n",
    " \\end{align*}$\n",
    "where:\n",
    "\n",
    "* $x$: Independent variable(regressor);\n",
    "* $Y$: Dependent variable(response);\n",
    "* $\\epsilon$: Random error;\n",
    "* $\\beta_0, \\beta_1, \\sigma^2$: Unknown parameters;\n",
    "* $\\sigma^2$: Random error variance \n",
    "\n",
    "#### 估計方法 (Methods of Estimation) —— 點估計(Point Estimation) \n",
    "\n",
    "我們對模型的變量$X, Y$進行了$n$次獨立觀察，得樣本：\n",
    "$$(X_1, Y_1), (X_2, Y_2), \\dots, (X_n, Y_n)$$ \n",
    "\n",
    "這組樣本：$y_i = \\beta_0 + \\beta_1 x_i + \\epsilon_i, \\ \\ \\ i = (1,\\dots,n)$\n",
    "我們可以將樣本帶入方程中得出$p$個式子，每一個式子裡都會有一個隨機誤差。\n",
    "\n",
    "我們需要估計$\\beta_0$和$\\beta_1$\n",
    "\n",
    "##### 最小二乘法 (Least Square Estimiation) \n",
    "\n",
    "* 觀測點：$(X_1, Y_1), (X_2, Y_2), \\dots, (X_n, Y_n)$\n",
    "* 函數：$y_i \\beta_0 + \\beta_1 x_i + \\epsilon_i , \\ \\ \\ i = 1,\\dots, n$\n",
    "* $\\epsilon_i, \\dots, \\epsilon_n$ are i.i.d, $\\ E(\\epsilon) = 0, \\ \\ Var(\\epsilon_i) = \\sigma^2$ \n",
    "\n",
    "我們想要估計的是$\\beta_0, \\beta_1$和$\\sigma^2$，我們可以使用**最小二乘法**去進行估計。\n",
    "\n",
    "* 殘差：$e_i = y_i - \\hat{y_i}$;\n",
    "* $Y$的估計值：$\\hat{y_i} = \\hat{beta_0} + \\hat{beta_1}x_i$；\n",
    "* $\\beta_0$和$\\beta_1$ 的估計值：$\\hat{\\beta_0}$ 和 $\\hat{\\beta_1}$.\n",
    "\n",
    "使用最小二乘法的目的就是使得殘差最小化。\n",
    "\n",
    "$$\\displaystyle Q = \\sum_{i=1}^n e_i^2 = \\sum_{i=1}^n(y_i - \\beta_0 - \\beta_1 x_i)^2$$\n",
    "\n",
    "For minimizeing $Q$ with respect to $\\beta_0$ and $\\beta_1$, take derivatives:\n",
    "\n",
    "$\\begin{align*}\n",
    "    &\\frac{\\partial Q}{\\partial \\beta_0} = -2\\sum_{i=1}^n (y_i - \\beta_0 - \\beta_1x_i) = 0 \\\\\n",
    "    &\\frac{\\partial Q}{\\partial \\beta_1} = -2\\sum_{i=1}^n (y_i -\\beta_0 -\\beta_1 x_i)x_i = 0 \\\\\n",
    "    &S_{XX} = \\sum_{i=1}^n (x_i - \\bar{x})^2 = \\sum_{i=1}^n x_i^2 - \\frac{1}{n}(\\sum_{i=1}^n x_i)^2 \\\\\n",
    "    &S_{XY} = \\sum_{i=1}^n (x_i - \\bar{x})(y_i - \\bar{y}) = \\sum_{i=1}^n x_i y_i - \\frac{1}{n}(\\sum_{i=1}^n x_i)(\\sum_{i=1}^n y_i) \\\\\n",
    "  \\end{align*}$\n",
    "$\\Rightarrow \n",
    "    \\begin{equation}\n",
    "        \\left\\{\n",
    "             \\begin{array}{lr}\n",
    "                 \\hat{\\beta_1} = S_{XY} / S_{XX} \\\\\n",
    "                 \\hat{\\beta_0} = \\bar{y} - \\hat{\\beta_1} \\bar{x} \\\\\n",
    "              \\end{array}\n",
    "        \\right.\n",
    "    \\end{equation}\n",
    "$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this section(Using PY in Linear Regression), we will use `numpy`, `scipy.stats`, `matplotlib.pyplot` to do the regression analysis. You can also try `scikit-learn` by yourselves . If you want to know more about the `scikit-learn`, **please skim through its offical website:** <a href=\"https://scikit-learn.org/stable/index.html\"> <strong>scikit-learn</strong> </a>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Hours</th>\n",
       "      <th>Scores</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>25.000000</td>\n",
       "      <td>25.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>5.012000</td>\n",
       "      <td>51.480000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>2.525094</td>\n",
       "      <td>25.286887</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>1.100000</td>\n",
       "      <td>17.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>2.700000</td>\n",
       "      <td>30.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>4.800000</td>\n",
       "      <td>47.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>7.400000</td>\n",
       "      <td>75.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>9.200000</td>\n",
       "      <td>95.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           Hours     Scores\n",
       "count  25.000000  25.000000\n",
       "mean    5.012000  51.480000\n",
       "std     2.525094  25.286887\n",
       "min     1.100000  17.000000\n",
       "25%     2.700000  30.000000\n",
       "50%     4.800000  47.000000\n",
       "75%     7.400000  75.000000\n",
       "max     9.200000  95.000000"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from scipy import stats\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import pandas as pd\n",
    "\n",
    "\n",
    "dataset= pd.read_csv(\"https://raw.githubusercontent.com/TerenceLiu98/Using-R-Series/master/data_set/student_scores.csv\")\n",
    "dataset.shape # dataset's shape\n",
    "dataset.head() # the first five data of the dataset \n",
    "dataset.describe() # to see some statistical details "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYIAAAEWCAYAAABrDZDcAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xu8XfOd//HXWxJyEjQiobmQpCOCCqInKlImkxCjlIxph7ba1GgxPJSaZkRvOuZn8NDfqJmqNsUwU3d159dGBaV1O0lUEKTUJSepHCqSECV8fn+stdmJc1nn5Ky99uX9fDz2Y6+19rp8zhb7s9b3u9bnq4jAzMwa1yZFB2BmZsVyIjAza3BOBGZmDc6JwMyswTkRmJk1OCcCM7MG50RgZtbgnAis10h6XtL+Gyz7iqT7i4qpN6V/y7uS1khaJelRSYcUHVe5evq+rXKcCKwmSepb0KEfiIjNgUHAJcC1kgZ3ZwcFxm7WLicCqyhJO0u6R9JKSU9IOrTss3skfbVsfr2zW0kh6URJS4AlSpwvaYWk1yU9JmnXdo55pKSWDZZ9Q9It6fSnJT0pabWkVknf7OrviIj3gEuBJuBj6X4OSa8SVkr6naTdyo73vKTTJD0GvCGpr6TtJN0gqU3Sq5J+VLb+P0paLOk1Sb+SNGqD7+F4SUvSzy9Mv4udgZ8Ak9KrlpXp+gdLWphexbwk6fsbfBdflvRCGsN3y6/sJG0iabakZ9PPu534rPo5EVjFSOoH3ArMBbYBTgKukDSuG7uZAXwS2AWYDuwH7Ehyhn4E8Go729wCjJM0tmzZF4Ar0+lLgOMiYgtgV2Behr+lL/BVYA1JUtqTJDEcB2wN/BS4RdJmZZt9Hjg4jTWA24AXgNHACODqdN8zgG8BhwNDgfuAqzYI4RBgIrA78A/AgRGxGDie9KolIgal674BfDk97sHAP6XHQNIuwI+BLwLDgI+ksZR8neQ7/2tgOPAacGFX34/VmIjwy69eeQHPk/wwrix7vQncn36+L/AnYJOyba4Cvp9O3wN8teyzr5S2TecDmFo2PxV4Bti7fJ8dxPZz4Hvp9FhgNTAgnX+R5Ad8yy728RVgXfp3vQI8COyffnYR8G8brP808Ndl380/ln02CWgD+rZznP8HHFM2v0n6PY4q+x4+Vfb5tcDs9r6zDv6OHwLnp9PfA64q+2wA8HbZ37UYmFb2+TDgnfbi9qt2X74isN42IyIGlV7ACWWfDQdeiqRZpeQF1j8D7cpLpYmImAf8iOQM9WVJcyRt2cF2V5KckUNyNXBTRLyZzv898GngBUn3SprUyfEfTP+2IRGxd0T8Ol0+CvjntFloZdossx3J3/yh2NPPXoiIde0cYxRwQdl+/gyI9b+nP5VNvwls3lHAkj4p6e60Cep1kquGIenHw1n/O32T9a+qRgE3lsWyGHgX2Laj41ntcSKwSloGbCep/N/d9kBrOv0GyRlpyUfb2cd65XIj4j8j4hPAx0maiGZ1cOy5wBBJe5AkhFKzEBHxSEQcRtJcdRPJGXZ3vQScVZ4EI2JARJQ36cQG62/fQcfxSyRNVeX7aoqI32WIo71ywleSNI9tFxEfIelHUPrZcmBkaUVJTSRNW+WxHLRBLP0johWrG04EVkkPkfzY/4ukfpKmAJ8hbRsHHgUOlzRA0g7AMZ3tTNLE9Gy3X7rft0jOVj8kPfO+HjgPGAzcme5jU0lflPSRiHgHWNXRPrrwM+D4NB5JGph20m7RwfoPk/wIn5Ou21/S5PSznwCnS/p4GuNHJH0uYxwvAyMlbVq2bAvgzxHxlqS9SK6ISq4HPiNpn3Sbf+WDJFGK5axSZ7WkoZIOyxiL1QgnAquYiHgbOBQ4iKSN/cfAlyPiqXSV80nap18GLgeu6GKXW5L8AL9G0sT0KvCDTta/EtgfuG6DJpkvAc9LWkXSbHJUN/4sACKiBfgaSVPVa8AfSNrrO1r/XZIkuANJH8VSks5uIuJG4Fzg6jSmx0m+syzmAU8Af5L0SrrsBOBMSatJ+gTev+KJiCdIOu2vJklMq4EVwF/SVS4guZqYm27/IElnvdURRXhgGjNLSNqcpDN8bET8seh4rDJ8RWDW4CR9Jm2OG0hyRbWI5C4naxBOBGZ2GElH/jKSW2uPDDcVNBQ3DZmZNThfEZiZNbiaKH41ZMiQGD16dNFhmJnVlPnz578SEUO7Wq8mEsHo0aNpaWnpekUzM3ufpBeyrOemITOzBudEYGbW4JwIzMwaXE30EbTnnXfeYenSpbz11ltFh1IV+vfvz8iRI+nXr1/RoZhZjanZRLB06VK22GILRo8ejaSuN6hjEcGrr77K0qVLGTNmTNHhmFmNqdlE8NZbbzkJpCSx9dZb09bWVnQoZtaBmxa2ct6vnmbZyrUMH9TErAPHMWNCd4biyE/NJgLASaCMvwuz6nXTwlZOv2ERa99JKpy3rlzL6TcsAqiKZODOYjOznJ33q6ffTwIla995l/N+9XRBEa3PiWAjnHXWWXz84x9nt912Y4899uChhx4qOiQzq0LLVq7t1vJKq+mmoe7o7fa5Bx54gNtuu40FCxaw2Wab8corr/D222/3eH/r1q2jb9+G+c9h1lCGD2qitZ0f/eGDmgqI5sMa4oqg1D7XunItwQftczct7Pmwq8uXL2fIkCFsttlmAAwZMoThw4fzyCOPsM8++7D77ruz1157sXr1at566y2OPvpoxo8fz4QJE7j77rsBuOyyyzj00EOZOnUq06ZNA+C8885j4sSJ7LbbbpxxxhkAvPHGGxx88MHsvvvu7LrrrlxzzTUb94WYWUXNOnAcTf36rLesqV8fZh04rqCI1tcQp6Cdtc/19Kpg+vTpnHnmmey4447sv//+HHHEEUyaNIkjjjiCa665hokTJ7Jq1Sqampq44IILAFi0aBFPPfUU06dP55lnngFgwYIFPPbYYwwePJi5c+eyZMkSHn74YSKCQw89lN/85je0tbUxfPhwbr/9dgBef/31jfg2zKzSSr8zvmuoQHm0z22++ebMnz+f++67j7vvvpsjjjiCb3/72wwbNoyJEycCsOWWWwJw//33c9JJJwGw0047MWrUqPcTwQEHHMDgwYMBmDt3LnPnzmXChAkArFmzhiVLlrDvvvvyzW9+k9NOO41DDjmEfffdt8dxm1kxZkwYUTU//BtqiESQV/tcnz59mDJlClOmTGH8+PFceOGF7d7G2dngPwMHDlxvvdNPP53jjjvuQ+vNnz+fO+64g+985ztMmzaN733vexsVu5lZSUP0EeTRPvf000+zZMmS9+cfffRRdt55Z5YtW8YjjzwCwOrVq1m3bh377bcfV1xxBQDPPPMML774IuPGffjYBx54IJdeeilr1qwBoLW1lRUrVrBs2TIGDBjAUUcdxaxZs1iwYEGP4zYz21BDXBHk0T63Zs0aTjrpJFauXEnfvn3ZYYcdmDNnDkcffTQnnXQSa9eupampiV//+teccMIJHH/88YwfP56+ffty2WWXvd/JXG769OksXryYSZMmAUnz089//nP+8Ic/MGvWLDbZZBP69evHRRdd1OO4zcw2VBNjFjc3N8eGA9MsXryYnXfeuaCIqpO/EzMrJ2l+RDR3tV5DNA2ZmVnHck0Ekk6W9LikJySdki4bLOlOSUvS963yjMHMzDqXWyKQtCvwNWAvYHfgEEljgdnAXRExFrgrne+RWmjWqhR/F2bWU3leEewMPBgRb0bEOuBe4O+Aw4DL03UuB2b0ZOf9+/fn1Vdf9Q8gH4xH0L9//6JDMbMalOddQ48DZ0naGlgLfBpoAbaNiOUAEbFc0jbtbSzpWOBYgO233/5Dn48cOZKlS5e6Bn+qNEKZmVl35ZYIImKxpHOBO4E1wO+Bdd3Yfg4wB5K7hjb8vF+/fh6Ny8ysF+T6HEFEXAJcAiDp34GlwMuShqVXA8OAFXnGYGZWayo9mlnedw1tk75vDxwOXAXcAsxMV5kJ3JxnDGZmtSSPasldyfs5gl9IehK4FTgxIl4DzgEOkLQEOCCdNzMzihnNLO+moQ+VyYyIV4FpeR7XzKxWFTGamZ8sNjOrIh1VRc5zNDMnAjOreTctbGXyOfMYM/t2Jp8zL9f29LwVMZpZQ1QfNbP6VepcLbWrlzpXgaodCKYzRYxm5kRgZjUtj6Foi1bp0cycCMys5pTfZ99RkZk8O1frjROBmdWUDZuCOpJn52q9cWexmdWU9pqCNpR352q98RWBmdWUzpp8BBXpXK03TgRmVlOGD2qitZ1kMGJQE7+dPbWAiGqfm4bMrKYUcZ99vfMVgZnVlCLus693TgRmVnMqfZ99vXPTkJlZg3MiMDNrcG4aMjMrU+nRwaqBE4GZWareCthllfdQld+Q9ISkxyVdJam/pDGSHpK0RNI1kjbNMwYzs6yKGB2sGuSWCCSNAL4ONEfErkAf4EjgXOD8iBgLvAYck1cMZmbdUcToYNUg787ivkCTpL7AAGA5MBW4Pv38cmBGzjGYmWVSxOhg1SC3RBARrcAPgBdJEsDrwHxgZUSsS1dbCrTb8CbpWEktklra2tryCtPM7H2N+tRynk1DWwGHAWOA4cBA4KB2Vm23nHhEzImI5ohoHjp0aF5hmpm9b8aEEZx9+HhGDGpCJPWLzj58fF13FEO+dw3tD/wxItoAJN0A7AMMktQ3vSoYCSzLMQYzs25pxKeW8+wjeBHYW9IASQKmAU8CdwOfTdeZCdycYwxmZtaFPPsIHiLpFF4ALEqPNQc4DThV0h+ArYFL8orBzMy6lusDZRFxBnDGBoufA/bK87hmZpadaw2ZmTU4l5gwsx5rxLo89ciJwMx6pFHr8tQjNw2ZWY80al2eeuQrAjPrkUasy1OvTWG+IjCzHmm0ujylprDWlWsJPmgKu2lha9GhbTQnAjPrkUary1PPTWFuGjKzHik1idRjU0l76rkpzInAzHqskeryDB/URGs7P/r10BTmpiEzswzquSnMVwRmZhnUc1OYE4GZWUb12hTmpiEzswaXKRFIGiVp/3S6SdIW+YZlZmaV0mUikPQ1knEFfpouGgnclGdQZmZWOVmuCE4EJgOrACJiCbBNnkGZmVnlZEkEf4mIt0szkvrSwYDz5SSNk/Ro2WuVpFMkDZZ0p6Ql6ftWG/MHmJnZxsmSCO6V9C2gSdIBwHXArV1tFBFPR8QeEbEH8AngTeBGYDZwV0SMBe5K583MrCBZEsFsoI1k3OHjgDuA73TzONOAZyPiBeAw4PJ0+eXAjG7uy8zMelGnzxFI6gP8T0R8EfjZRhznSOCqdHrbiFgOEBHLJbm/wczqtsRzLej0iiAi3gVGSdq0pwdItz2UpEmpO9sdK6lFUktbW1tPD29mNaCeSzzXgixNQ88Bv5X0XUmnll7dOMZBwIKIeDmdf1nSMID0fUV7G0XEnIhojojmoUOHduNwZlZr6rnEcy3IkgieBW5L192i7JXV5/mgWQjgFmBmOj0TuLkb+zKzOlTPJZ5rQZe1hiLiXwHSp4kjItZk3bmkAcABJJ3MJecA10o6BngR+Fy3IjazulPPJZ5rQZeJQNKuwP8Cg9P5V4AvR8QTXW0bEW8CW2+w7FWSu4jMrJfUekfrrAPHcfoNi9ZrHqqXEs+1IEv10TnAqRFxN4CkKSR3EO2TY1xmllGpo7X0I1rqaAVqJhnUc4nnWpAlEQwsJQGAiLhH0sAcYzKzbuiso7WWfkjrtcRzLciSCJ6T9F2S5iGAo4A/5heSmXWHO1ptY2W5a+gfgaHADelrCHB0nkGZWXYddai6o9Wy6jIRRMRrEfH1iNgzfZ0SEa9VIjgz61o9j6VrlZFlPII7JQ0qm99K0q/yDcvMspoxYQRnHz6eEYOaEDBiUBNnHz7e7e2WWZY+giERsbI0ExGvuT6QWXVxR6ttjCx9BO9J2r40I2kUGcYjMDOz2pDliuDbwP2S7k3n9wOOzS8kMzOrpCwlJn4paU9g73TRNyLilXzDMjOzSumwaUjSKEkfAUh/+N8gqRv05Y0pS21mZtWlsz6Ca4GBAJL2IBlP4EVgd+DH+YdmZmaV0FnTUFNELEunjwIujYj/K2kT4NH8QzOzklovKmfVrbMrApVNTyUZaJ6IeC/XiMxsPR69y/LWWSKYJ+laSRcAWwHz4P1Rxd6uRHBm5tG7LH+dNQ2dAhwBDAM+FRHvpMs/SnJLqZlVgIvKWd46TAQREcDV7SxfmGtEZrYej95lecvyZHGPSRok6XpJT0laLGmSpMFp/aIl6ftWecZgVi1uWtjK5HPmMWb27Uw+Z17mNn4XlbO85ZoIgAuAX0bETiS3nS4GZgN3RcRYkg7o2TnHYFa4jenwdVE5y5uSFqAuVpKagO0jInPvlKQtgd8DH4uyg0h6GpgSEcvTjud7IqLTU5vm5uZoaWnJemizqjP5nHntNu+MGNTEb2dPLSAiawSS5kdEc1frZSlD/RmS5wZ+mc7vIemWDDF8DGgD/lvSQkkXp0NcbhsRywHS93YrmUo6VlKLpJa2trYMhzOrXu7wtWqWpWno+8BewEqAiHgUGJ1hu77AnsBFETGBpERF5magiJgTEc0R0Tx06NCsm5lVJY8iZtUsSyJYFxGv92DfS4GlEfFQOn89SWJ4OW0SKj2TsKIH+zarKe7wtWqWJRE8LukLQB9JYyX9F/C7rjaKiD8BL0kq/UufBjwJ3ALMTJfNBG7ufthmtcUdvlbNuuwsljSA5AGy6SRlJ34F/FtEvNXlzpNidRcDmwLPkQx6vwlJQbvtSYrYfS4i/tzZftxZbGbWfVk7izPdNVQ0JwIzs+7Lmgi6HJhG0q18eGjK14EW4KdZrgzMzKx6ZekjeA5YA/wsfa0CVgM7pvNmZlbDsoxZvE9ETCybv1XSIxExUdITeQVmZmaVkeWKYHNJ25dm0unN01mXozYzq3FZrgj+Gbhf0rMkdw2NAU5InxK+PM/gzMwsf10mgoi4Q9JYYCeSRPBUWQfxD/MMzszM8pfligBgLDAO6A/sJomI+J/8wjKrHI8HbI0uy+2jZwBTgF2AO4CDgPsBJwKreaXy0KWhIEvloQEnA2sYWTqLP0tSHuJPEXE0ybgCH8k1KrMK8XjAZtkSwdqIeA9Yl44xsALYLt+wzCrD5aHNsiWCFkmDSB4emw8sAB7INSqzCnF5aLMMiSAiToiIlRHxE+AAYGbaRGRW81we2izbCGV3laYj4vmIeKx8mVktc3los07uGpLUHxgADJG0FckzBABbAsMrEJtZRcyYMMI//NbQOrt99DjgFJIf/fl8kAhWARfmHJeZmVVIh4kgIi4ALpB0UkT8VwVjMjOzCspSYuK/JO1DMmB937LlXT5QJul5kpLV75KMfdwsaTBwTbq/54F/iIjXehC7mZn1giydxf8L/AD4FDAxfXU54k2Zv4mIPcpGyZkN3BURY4G70nkzMytIllpDzcAu0XtjWh5GUrICkuql9wCn9dK+zcysm7I8UPY48NEe7j+AuZLmSzo2XbZtRCwHSN+3aW9DScdKapHU0tbW1sPDm5lZV7JcEQwBnpT0MPCX0sKIODTDtpMjYpmkbYA7JT2VNbCImAPMgWTw+qzbmZlZ92RJBN/v6c4jYln6vkLSjcBewMuShkXEcknDSGoXmZlZQbKUmLiX5O6efun0IyT1hjolaaCkLUrTwHSSZqZbgJnpajOBm3sUuZmZ9Yos4xF8DTgWGAz8FTAC+AlJaerObAvcKKl0nCsj4peSHgGulXQM8CLwuZ6Hb2ZmGytL09CJJE06DwFExJK0zb9TEfEcydgFGy5/la6TiFlV8mhmVo+yJIK/RMTb6Zk9kvqS3A1k1lA8mpnVqyy3j94r6VtAk6QDgOuAW/MNy6z6eDQzq1dZEsFsoA1YRFKI7g7gO3kGZVaNPJqZ1assTUNNwKUR8TMASX3SZW/mGZhZtRk+qInWdn70PZqZ1bosVwR3kfzwlzQBv84nHLPq5dHMrF5luSLoHxFrSjMRsUbSgBxjMqtKpQ5h3zVk9SZLInhD0p4RsQBA0icAN4paQ/JoZlaPsiSCk4HrJC1L54cBR+QXkpmZVVKniUDSJsCmwE7AOJLhKp+KiHcqEJuZmVVAp4kgIt6TdGFETCCpE2RmZnUm011Dkv5epUeLzcysrmRJBMeRPE38tqRVklZLWpVzXGZmViFZBq/fohKBWPVyoTWz+pZl8HpJOkrSd9P57STtlX9oVg1KhdZaV64l+KDQ2k0LW4sOzcx6SZamoR8Dk4AvpPNrgAtzi8iqSq0WWrtpYSuTz5nHmNm3M/mceU5cZp3I8hzBJyNiT0kLASLiNUmb5hyXVYlaLLTmctFm3ZPliuCdtNBcAEgaCryX9QCS+khaKOm2dH6MpIckLZF0jZNKdeuooFo1F1qr1asYs6JkSQT/CdwIbCPpLOB+4N+7cYyTgcVl8+cC50fEWOA14Jhu7MsqrBYLrdXiVYxZkbIMXn8F8C/A2cByYEZEXJdl55JGAgcDF6fzAqYC16erXA7M6H7YVikzJozg7MPHM2JQEwJGDGri7MPHV3UTSy1exZgVqcM+Akn9geOBHUgGpflpRKzr5v5/SJJESregbg2sLNvPUqB6f1EMqL1Ca7MOHLdeHwFU/1WMWZE6uyK4HGgmSQIHAT/ozo4lHQKsiIj55YvbWbXd8Y8lHSupRVJLW1tbdw5tDa4Wr2LMiqSI9sehl7QoIsan032BhyNiz8w7ls4GvgSsA/oDW5L0NRwIfDQi1kmaBHw/Ig7sbF/Nzc3R0tKS9dBmZgZImh8RzV2t19kVwfsVRnvQJEREnB4RIyNiNHAkMC8ivgjcDXw2XW0mcHN3921mZr2ns0Swe1pbaJWk1cBuvVRr6DTgVEl/IOkzuGQj9mVmZhupw87iiOjT0WfdFRH3APek088BLlFhZlYlsjxHYGZmdcyJwMyswTkRmJk1OCcCM7MGl6X6qFmv8AA3ZtXJicAqwqWhzaqXm4asIlwa2qx6ORFYRbg0tFn1ciKwinBpaLPq5URgFVGLA9yYNQp3FltFlDqEfdeQWfVxIrCKqbUBbswahZuGzMwanBOBmVmDcyIwM2twTgRmZg3OicDMrMHldteQpP7Ab4DN0uNcHxFnSBoDXA0MBhYAX4qIt/OKo550VrStqIJuLiRnVvvyvH30L8DUiFgjqR9wv6T/B5wKnB8RV0v6CXAMcFGOcdSFzoq2AYUUdHMhObP6kFvTUCTWpLP90lcAU4Hr0+WXAzPyiqGedFa0raiCbi4kZ1Yfcu0jkNRH0qPACuBO4FlgZUSsS1dZCrR76ijpWEktklra2tryDLMmdFa0raiCbi4kZ1Yfck0EEfFuROwBjAT2AnZub7UOtp0TEc0R0Tx06NA8w6wJnRVtK6qgmwvJmdWHitw1FBErgXuAvYFBkkp9EyOBZZWIodZ1VrStqIJuLiRnVh/yvGtoKPBORKyU1ATsD5wL3A18luTOoZnAzXnFUE+yFG2r9N07LiRnVh8U0W7LzMbvWNqNpDO4D8mVx7URcaakj/HB7aMLgaMi4i+d7au5uTlaWlpyidPMrF5Jmh8RzV2tl9sVQUQ8BkxoZ/lzJP0FVqX8bIBZY3EZaluPnw0wazwuMWHr8bMBZo3HicDW42cDzBqPE4Gtx88GmDUeJ4I6cdPCViafM48xs29n8jnzuGlha4/242cDzBqPO4vrQG928PrZALPG40TQy4q49bKzDt6eHNuDzJs1FieCXlTUrZfu4DWzjeE+gl5U1K2X7uA1s43hRNCLijozdwevmW0MJ4JeVNSZ+YwJIzj78PGMGNSEgBGDmjj78PFu5zezTNxH0ItmHThuvT4CqNyZuTt4zaynnAh6kW+9NLNa5ETQy3xmbma1xomghrg8tJnlwYmgRrg8tJnlJbe7hiRtJ+luSYslPSHp5HT5YEl3SlqSvm+VVww91Vt1e3qTy0ObWV7yvH10HfDPEbEzyaD1J0raBZgN3BURY4G70vmqUTrzbl25luCDM++ik4GfHjazvOSWCCJieUQsSKdXA4uBEcBhJGMZk77PyCuGnqjWM28/PWxmeanIA2WSRpOMX/wQsG1ELIckWQDbdLDNsZJaJLW0tbVVIkyges+8/fSwmeUl90QgaXPgF8ApEbEq63YRMScimiOieejQofkFuIFqPfP208Nmlpdc7xqS1I8kCVwRETeki1+WNCwilksaBqzIM4buKvLp4K74GQUzy0Oedw0JuARYHBH/UfbRLcDMdHomcHNeMfSEz7zNrNEoIvLZsfQp4D5gEfBeuvhbJP0E1wLbAy8Cn4uIP3e2r+bm5mhpacklTjOzeiVpfkQ0d7Vebk1DEXE/oA4+npbXcUv8FK6ZWTZ1+WSxn8I1M8uuLscjqNZnAczMqlFdJoJqfRbAzKwa1WUiqNZnAczMqlFdJgI/hWtmll1ddhZ7pDAzs+zqMhGAn8I1M8uqLpuGzMwsOycCM7MG50RgZtbgnAjMzBqcE4GZWYPLrfpob5LUBryQcfUhwCs5htNT1RhXNcYEjqs7qjEmqM64qjEmyDeuURHR5cheNZEIukNSS5ayq5VWjXFVY0zguLqjGmOC6oyrGmOC6ojLTUNmZg3OicDMrMHVYyKYU3QAHajGuKoxJnBc3VGNMUF1xlWNMUEVxFV3fQRmZtY99XhFYGZm3eBEYGbW4OomEUi6VNIKSY8XHUuJpO0k3S1psaQnJJ1cdEwAkvpLeljS79O4/rXomEok9ZG0UNJtRcdSIul5SYskPSqppeh4SiQNknS9pKfSf2OTCo5nXPodlV6rJJ1SZEwlkr6R/lt/XNJVkvpXQUwnp/E8UfT3VDd9BJL2A9YA/xMRuxYdD4CkYcCwiFggaQtgPjAjIp4sOC4BAyNijaR+wP3AyRHxYJFxAUg6FWgGtoyIQ4qOB5JEADRHRFU9jCTpcuC+iLhY0qbAgIhYWXRckCR0oBX4ZERkfRg0r1hGkPwb3yUi1kq6FrgjIi4rMKZdgauBvYC3gV8C/xQRS4qIp26uCCLiN8Cfi46jXEQsj4gF6fRqYDFQ+CAJkViTzvZLX4WfEUgaCRwMXFx0LNVO0pbAfsAlABHxdrV8ytX1AAAFQ0lEQVQkgdQ04Nmik0CZvkCTpL7AAGBZwfHsDDwYEW9GxDrgXuDvigqmbhJBtZM0GpgAPFRsJIm0CeZRYAVwZ0RUQ1w/BP4FeK/oQDYQwFxJ8yUdW3QwqY8BbcB/p01pF0saWHRQZY4Erio6CICIaAV+ALwILAdej4i5xUbF48B+kraWNAD4NLBdUcE4EVSApM2BXwCnRMSqouMBiIh3I2IPYCSwV3qpWhhJhwArImJ+kXF0YHJE7AkcBJyYNkMWrS+wJ3BRREwA3gBmFxtSIm2mOhS4ruhYACRtBRwGjAGGAwMlHVVkTBGxGDgXuJOkWej3wLqi4nEiyFnaBv8L4IqIuKHoeDaUNifcA/xtwaFMBg5N2+OvBqZK+nmxISUiYln6vgK4kaRdt2hLgaVlV3LXkySGanAQsCAiXi46kNT+wB8joi0i3gFuAPYpOCYi4pKI2DMi9iNp1i6kfwCcCHKVdspeAiyOiP8oOp4SSUMlDUqnm0j+R3mqyJgi4vSIGBkRo0maFeZFRKFnbQCSBqYd/aRNL9NJLusLFRF/Al6SNC5dNA0o9CaEMp+nSpqFUi8Ce0sakP4/OY2kv65QkrZJ37cHDqfA76xuBq+XdBUwBRgiaSlwRkRcUmxUTAa+BCxK2+MBvhURdxQYE8Aw4PL0zo5NgGsjompu16wy2wI3Jr8f9AWujIhfFhvS+04CrkibYp4Dji44HtL27gOA44qOpSQiHpJ0PbCApPllIVVQ1gH4haStgXeAEyPitaICqZvbR83MrGfcNGRm1uCcCMzMGpwTgZlZg3MiMDNrcE4EZmYNzonAapKkNRvMf0XSjyp4/L0lPZRW2Vws6fvp8imSuv2wkqTLJH02nb5Y0i7d2HZKNVVrtdpTN88RmPUGSX0i4t0Mq14O/ENE/D59HqP0YNcUkiq4v+tpDBHx1Z5ua9YTviKwuiNplKS7JD2Wvm+fLn//rDudX5O+T5F0n6RbgMXp08S3p+M1PC7piHYOsw1JAbNS3aYn08KCxwPfSK8U9u3kmJL0I0lPS/p1ur/SOvdIak6np0t6QNICSdeldauQ9LdKxiFYQPJUqlmPORFYrWpS2SAowJlln/2IZFyK3YArgP/MsL89ScZk2JGk7tKyiNg9HduivSeJzweelnSjpOMk9Y+I54GfAOdHxB4RcV8nx/s7kquIXYAv007tG0lDgO8A+6dF71qAU5UMqvIz4DPAJ4CPZvj7zDrkRGC1am36Y7tHWkX1e2WfTQKuTKf/F/hUhv09HBF/TKcXAftLOlfSvhHx+oYrR8SZJAPozAW+QPvJojP7AVelVxPLgHntrLM3SaL4bZrsZgKjgJ1IiqgtiaQ0QFUU57Pa5URgjaBUR2Ud6b/5tPjYpmXrvPH+yhHPkJxpLwL+j6TyJEPZes9GxEUkRcx2T+vGbKizY3ZV30UkY0WUEt4uEXFMxm3NMnMisHr0O5IKpgBfJBmmEOB5kh94SOrT92tvY0nDgTcj4ufAebRT3lnSwekPO8BY4F1gJbAa2KJs1Y6O+RvgyHSAoGHA37QTyoPAZEk7pMccIGlHkkqxYyT9Vbre59v7O8yy8l1DVo++DlwqaRbJKF6lqpw/A26W9HuSppw3Oth+PHCepPdIKkP+UzvrfAk4X9KbJGf9X4yIdyXdClwv6TCS6qAdHfNGYCpJ6egXgQc2PEBEtEn6CnCVpM3Sxd+JiGeUjJR2e3r8+1g/+Zh1i6uPmpk1ODcNmZk1OCcCM7MG50RgZtbgnAjMzBqcE4GZWYNzIjAza3BOBGZmDe7/A7iZlUC/jGk4AAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x6cc5c6d0>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "dataset.plot(x='Hours', y='Scores', style='o')  \n",
    "plt.title('Hours vs Percentage')  \n",
    "plt.xlabel('Hours Studied')  \n",
    "plt.ylabel('Percentage Score')  \n",
    "plt.show()  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "b_1, b_0, r_value, p_value, std_err = stats.linregress(dataset.Hours,dataset.Scores)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Linear regression using stats.linregress\n",
      "parameters: b_1=9.78 b_0=2.48 \n",
      "model parameters: r_value =0.98, p_value =0.00, std error= 0.453\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print('Linear regression using stats.linregress')\n",
    "print('parameters: b_1=%.2f b_0=%.2f \\nmodel parameters: r_value =%.2f, p_value =%.2f, std error= %.3f' % (b_1, b_0, r_value, p_value, std_err))\n",
    "print('\\n')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAARIAAADTCAYAAABN9JyGAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAHWZJREFUeJzt3Xt4VNW5x/HvK0QDAbl7g2LAg4ACkhgqgqJoNVUpAmrV1kq8FKReKt4KPVVqSwsWDkWt0gdF1KoUiogKeEREDsVWJAlXuRgLAROQmwICCSbwnj9mcpnJTDK3nT2TeT/P4yOz2bNnhYQfa639rrVFVTHGmGic5HYDjDGJz4LEGBM1CxJjTNQsSIwxUbMgMcZEzYLEGBM1CxJjTNQsSIwxUbMgMcZErbHbDQhF27ZtNT093e1mGJN08vLy9qlqu7rOS4ggSU9PJzc31+1mGJN0RGR7KOfZ0MYYEzULEmNM1BJiaGOMCc381cVMen8LOw+UcFbLJjya3ZUhGe0d/9yEDZKysjKKioooLS11uykmRKmpqXTo0IGUlBS3m9IgzV9dzNh56ykpOw5A8YESxs5bD+B4mCRskBQVFdG8eXPS09MREbebY+qgquzfv5+ioiI6derkdnMapEnvb6kMkQolZceZ9P4Wx4MkYedISktLadOmjYVIghAR2rRpYz1IB+08UBLW8VhK2CABLEQSjH2/nHVWyyZhHY+lhA4SY0yVR7O70iSlkc+xJimNeDS7q+OfbUHSwDzxxBMsWbIk6ussW7aMQYMGAfDOO+8wceLEqK9pnDUkoz0ThvWkfcsmCNC+ZRMmDOtpd20Shaqiqpx0UuS5XF5eTuPG0X87fve730V9DX+DBw9m8ODBMb+uib0hGe1rD45hw+Ctt2DHDvje92L2uQ0jSB58ENasie01e/eGqVOD/nZhYSHZ2dlcdNFF5OXlsWjRIrZs2cK4ceM4duwY55xzDjNnzqRZs2YsWrSIhx56iLS0NPr378/WrVtZsGABv/3tb/nPf/7D1q1b6dixI6+99hpjxoxh2bJlHDt2jHvvvZeRI0eya9cubr75Zg4dOkR5eTnTpk2jX79+3HXXXeTm5iIi3HnnnYwePZqcnBwGDRrEjTfeyIcffsgjjzxCeXk5ffr0Ydq0aZxyyimkp6czfPhw3n33XcrKyvjHP/5Bt27dgn6tL7/8Mrm5ufzlL38hJyeHU089ldzcXL766iv+9Kc/ceONNwIwadIk5syZw7Fjxxg6dChPPvlkbL8nJnJFRb7B0aZNTC9vQ5soFBQU8Itf/ILPPvuMtLQ0xo8fz5IlS8jPzycrK4spU6ZQWlrKyJEjee+998jLy2Pv3r0+19i4cSNLlixh1qxZzJgxgxYtWrBq1SpWrVrFCy+8wLZt23jjjTfIzs5mzZo1rF27lt69e7NmzRqKi4vZsGED69ev54477vC5bmlpKTk5OcyePZv169dXBlCFtm3bkp+fz6hRo5g8eXJYX/euXbtYsWIFCxYsYMyYMQAsXryYgoICPv30U9asWUNeXh7Lly+P8E/WxNT//I9viJSUQNOmMf2IhtEjqaXn4KSzzz6bvn37AvDJJ5+wceNG+vfvD8B3333HxRdfzObNm+ncuXNl7cStt97K9OnTK68xePBgmjTxzKovXryYdevWMXfuXAAOHjxIQUEBffr04c4776SsrIwhQ4bQu3dvOnfuzNatW7n//vu57rrruPrqq33atmXLFjp16sS5554LwPDhw3nuued48MEHARg2bBgAF154IfPmzQvr6x4yZAgnnXQS5513Hrt3765s++LFi8nIyADg8OHDFBQUMGDAgLCubWKotBSaVLtjM3kyPPywIx/VMILEJWlpaZW/VlWuuuoqZs2a5XPO6tWrw7rGs88+S3Z2do3zli9fzsKFC8nJyeGhhx7i9ttvZ+3atbz//vv89a9/Zc6cObz00ks+16rNKaecAkCjRo0oLy+v9dxg763+OarK2LFjGTlyZFjXMg754AOo/o/Ll19Chw6OfZwNbWKkb9++fPzxx3zxxRcAHD16lM8//5xu3bqxdetWCgsLAZg9e3bQa2RnZzNt2jTKysoA+Pzzzzly5Ajbt2/ntNNO4+c//zl33303+fn57Nu3jxMnTnDDDTcwfvx48vPzfa7VrVs3CgsLK9vzt7/9jcsuu8yBr7yq7S+99BKHDx8GoLi4mD179jj2eSYIVRCpCpFhwzzHHAwRsB5JzLRr146XX36ZW2+9lWPHjgEwfvx4zj33XJ5//nl++MMfkpaWRp8+fYJe4+6776awsJDMzExUlXbt2jF//nyWLVvGpEmTSElJoVmzZrz66qsUFxdzxx13cOLECQAmTJjgc63U1FRmzpzJTTfdVDnZes899zj29V999dVs2rSJiy++GIBmzZrx2muvcdpppzn2mcbPkiVw1VVVr1esAO9Qu4JTi/okEZ79m5WVpf4bG23atInu3bu71KLwHD58mGbNmqGq3HvvvXTp0oXRo0e73SxXJNL3LaH4Vw0fOVJjQtV/UR94CtZqqzURkTxVzarr421oUw9eeOEFevfuzfnnn8/BgwdtHsHEzu7dviGSmekZygS4K1Pbor5o2dCmHowePTppeyDGwT1C7rwTZs6ser1hA5x/ftDTnVzUZ0FijIMc2SPkxAlo5LumhhCmKM5q2YTiAKERi0V9NrQxxgHzVxfTf+JSHpy9JrbDiXnzfEPkhRdCChFwdlGf9UiMibFAk5r+IhpO+E+ofvcdhLHbXEUPyIlhlgWJMTEWaFLTX1jDibVrPWu/Klx7LSxcGFHb6lzUFyEb2tSDa6+9lgMHDtR6TjTL/6sv+a/N5ZdfXufzgaZOncrRo0cjaofxqKu3EdZwQsQ3RPLzIw4RJyVNj8SN3bUrthdYtGhRnec6sfw/ElOnTuW2226jaYwXdSWTYJOa4NkjJKSfvWPHIDXV91gc13wlRY+kYsxafKAEpWrmfP7q4qiuO2XKFHr06EGPHj2Y6l04WFhYSNeuXbn99tvp0aMHX375Jenp6ezbtw+A3//+93Tt2pVLLrmEW2+9tXLlbU5OTuVivfT0dMaNG0dmZiY9e/Zk8+bNAHz66af069ePjIwM+vXrx5YttU/YlZSUcMstt9C9e3eGDh1KSUnVD/eoUaPIysri/PPPZ9y4cQA888wz7Ny5k4EDBzJw4MCg55naBZvUnHpzbz4ec0XdIXLppb4h8uSTcR0ikCQ9Eid2187Ly2PmzJmsXLkSVeWiiy7isssuo1WrVhQUFPDKK69UrgyukJuby5tvvsmaNWsoLy8nMzOTCy+8MOD1K5b5P//880yePJkXX3yRbt26sXz5cho3bsySJUv49a9/zZtvvhm0jdOmTaNp06Zs2rSJdevWkZmZWfl7f/jDH2jdujXHjx/nyiuvZN26dTzwwANMmTKFjz76iLZt2wY9r1evXhH9mSWLqCY1/SdUy8tr3uqNQ0kRJE4U4qxYsYKhQ4dWrt4dNmwY//znPxk8eLDP9gL+77n++usrtw340Y9+FPT6gZb5Hzx4kOHDh1NQUICIVC7uC2b58uU88MADAPTq1csnAObMmcP06dMpLy9n165dbNy4MWBAhHqe8RX2pOa8eXDDDb7H4rwXUl1SBIkThTi1rVGqvjVAqO/xF2iZ/+OPP87AgQN56623KCws5PLLL6/zOoF2bt+2bRuTJ09m1apVtGrVipycnICPiQj1PBMlv+/R0r8v5vFtjdg5ZmG9Pi0vGo7OkYjIaBH5TEQ2iMgsEUkVkU4islJECkRktoic7GQbwJlCnAEDBjB//nyOHj3KkSNHeOutt7j00ktrfc8ll1zCu+++S2lpKYcPH2bBggVhfebBgwdp397zA/Xyyy+H1MbXX38dgA0bNrBu3ToADh06RFpaGi1atGD37t289957le9p3rw53377bZ3nmRj4+usaITI/v4h7NxyP+Xye0xwLEhFpDzwAZKlqD6ARcAvwFPBnVe0CfAPc5VQbKjixu3ZmZiY5OTl8//vf56KLLuLuu++u3B0smD59+jB48GB69erFNddcQ8+ePWnRokXIn/nYY48xduxYMjIyQtqMaNSoURw+fJju3bvzxBNPVM7HXHDBBWRkZNCtWzd+8pOfVO7qBjBixAiuueYaBg4cWOt5JkpnnOG7b+ojj4CqowvrnOTYNgLeIPkEuAA4BMwHngVeB85Q1XIRuRj4rarW3BKsmkTfRqC6ii0Fjh49yoABA5g+fbrPJGhDl6jft5jyH26eOFF5rNOYhQT6GynAtonXOd60Gp/r9jYCqloMTAZ2ALuAg0AecEBVK/45LQICdgtEZISI5IpIrv+GyYlsxIgR9O7dm8zMTG644YakCpGkN21azRCp2NHMy82n5UXDsclWEWkFXA90Ag4A/wCuCXBqwC6Rqk4HpoOnR+JQM+vdG2+84XYTjBv8A2TrVgjwMPVHs7sG3HyoPp6WFw0n79r8ANimqnsBRGQe0A9oKSKNvb2SDsDOSD9AVe15sgkkEXbji7lt26BzZ99jtfw5OLmwzklOBskOoK+INAVKgCuBXOAj4Ebg78Bw4O1ILp6amsr+/ftp06aNhUkCUFX2799Pqn/Zd0Pm/3P55JPwxBN1vs2phXVOcixIVHWliMwF8oFyYDWeocpC4O8iMt57bEYk1+/QoQNFRUU1Hjhl4ldqaiodHN7NPC6ogv/jWxt4byxhN382JpZitqjzuuvAf5FmAvwdCybUuzZJUdlqTG1ith2i/1Amhg+lcmP1ejiSYvWvMbWJugjs448D39aNYYg4sXo9lixITNKLalGnCFxySdXrP/4x5kOZRKh2taGNSXoRLeosK4OT/ZaJOTQX4uRjJGLFeiQm6YW9qFOk3kIEEqPa1YLEJL2wFnX6z4V8/bXjd2WcfIxErNjQxhhCKAJ76SW4y2+hej3d1k2EalcLEmPq4t8LefZZuO++em1CvFe7WpAYE8yhQ+C/X0wCF5c5yYLEmEACrd+yEAnKJluN8ecfIqWlFiJ1sCAxpsJvfhO4QtW7Ebe/igeFdxqzkP4Tl8ZVpWl9s6GNMVAzQN5+GwYPDnp6zNbnNBDWIzHJrbg4cC+klhCBxChbr0/WIzHJyz9AzjwTdoa2YV8ilK3XJwsSk5CiXlbvHyLHj9fcjKgWTjx0LZHZ0MYknKiW1d98c+ChTBghAolRtl6fLEhMwol4fkIE5sypev3JJxHf1nXioWuJzIY2JuGEPT+xfj34P/g8BnUh8V62Xp+sR2ISTljL6kV8Q+Sqq6y4zAEWJCbhhDQ/4fcEu8pjixfXQwuTjwWJSTh1zk/06pV0j4Nwm82RmIQUdH7CvxfyxRdwzjn106gkZj0S0zB88EHgoYyFSL2wHomJe3UWn/kHyP33wzPP1G8jk5wFiYlrtS6O63k6pKT4vsHmQlxhQWLiWrDisyGZAR4+ZSHiGpsjMXEtUJFZ4VODfA/s328h4jJHg0REWorIXBHZLCKbRORiEWktIh+ISIH3/62cbIOJX6FsDFS9yOzG9UtqhogqtG7tdFNNHZzukTwN/K+qdgMuADYBY4APVbUL8KH3tUkyoS68qyg+K3xqEJMXTa08vv6hcdYLiSOiDn0zRORUYC3QWat9iIhsAS5X1V0iciawTFVrXTKZlZWlubm5jrTTuKP/xKUBl+G3b9mEj8dcUXXg8GFo3tznnPn5RbbGpZ6ISJ6qZtV1npOTrZ2BvcBMEbkAyAN+CZyuqrsAvGFyWqA3i8gIYARAx44dHWymcUNIC++C7OQ+xKE2mcg5ObRpDGQC01Q1AzhCGMMYVZ2uqlmqmtWuXTun2mhcUufCO/8QKSmxoUwcczJIioAiVV3pfT0XT7Ds9g5p8P5/j4NtMHEq2MK7GdsWBK5QTU2tx9aZcDkWJKr6FfCliFTMf1wJbATeAYZ7jw0H3naqDSZ+BVp4t2n8NXSb/ueqk+bOtV5IgnC6IO1+4HURORnYCtyBJ7zmiMhdwA7gJofbYOJU5cK7nTuhvd/kqQVIQnE0SFR1DRBoxvdKJz/XJBD/YUy7drDHRruJxkrkjXui3MndxA/7rpmIRPW4ykceiclO7iZ+WI/EhC2qx1X6B8jq1dC7txPNNPWozn8CROQ+Ww9jqovocRAFBYF7IRYiDUIofckzgFUiMkdEfigSqNzQJJOwHwchAueeW/X6Zz+zuzINTJ1Boqq/AboAM4AcoEBE/igitoddkgr5cRDBdnJ/9VWHWmbcEtLslnfR3Vfe/8qBVsBcEfmTg20zcSqkx0EMGGA7uSeROidbReQBPBWo+4AXgUdVtUxETgIKgMecbaKJNxUTqkH3UfXvhWzfDrbwskEL5a5NW2CYqm6vflBVT4jIoCDvMQ1cwMdBfPQRXHGF7zHrhSSFUOZInvAPkWq/tyn2TTIJScQ3RO65x0IkiVgdiYlOebnt5G4sSEwUUlI8QVKdhUhSsppkExkR3xDZs8dCJIlZkJjwPPdc4NoQ28UuqdnQxoTOP0D++79h/Pgap9X5iE3T4FiQmLp9+y2ceqrvsSDDmKgW9JmEZUMbUzuRkEMEIlzQZxKeBYkJzn8oc+RInROqYS/oMw2CBYmpafTowBOqTZvW+daQF/SZBsWCxPgSgalVj8ZkxoywbuuGtKDPNDg22Wo8iouhQwffYxHUhdS5oM80SBYkJuijMSMVcEGfadBsaJPs/EOkvNwqVE3YrEcS5xwr7srOhsWLfY9ZgJgIWZDEMceKu/x7IR98AD/4QeTXM0nPgiSO1VbcFVGQrF1bc9f2WipUbcLUhMqCJI7FtLjLvxdy9tlQWBjwVCtzN+FyfLJVRBqJyGoRWeB93UlEVopIgYjM9j5g3AQQk+KuYDu5BwkRsDJ3E776uGvzS6D6loxPAX9W1S7AN8Bd9dCGhBR1cdfpp0e0k7uVuZtwORokItIBuA7P7vN4H651BTDXe8orwBAn25DIhmS0Z8KwnrRv2QQB2rdswoRhPUMbXoh4NhuqsH59yHdlrMzdhMvpOZKpeB5X0dz7ug1wQFUrttYqAgL+rRCREcAIgI5J/CiDsIu7/u//4PLLfY+FeVv30eyuPnMkYGXupnaO9Ui8j6rYo6p51Q8HODXgT7mqTlfVLFXName7b4VGxDdEbrop4jL3iHtCJik52SPpDwwWkWuBVOBUPD2UliLS2Nsr6QDsdLANyeH4cWjs962MsrjMytxNOBzrkajqWFXtoKrpwC3AUlX9KfARcKP3tOHA2061ISk0bRrzEDEmXG6stfkV8JCIfIFnzmSGC21oGESgpNqdlN27LUSMK+qlIE1VlwHLvL/eCny/Pj63wXrjDfjpT32PWYAYF1lla6LxLy6bOBF+9SufQ1bebuqbBUmiOHIEmjXzPRagF2Ll7cYNth9JIhAJKUTAytuNOyxI4l2YO7lbebtxgwVJvJo+PaKd3K283bjBgiQeicDIkVWvZ80K+a6M7eJu3GCTrfFk/35o29b3WJi3dW0Xd+MGC5J44T+MyciA/PyILmXl7aa+WZDEg0A7uTdqFPhcY+KQBUmMBSoGgyBDjccfh/HjfS8QwlDGCs5MvBFNgNLqrKwszc3NdbsZdfIvBgNIOUlAoOx41Z9zk5RGbBp/je+bV62CrKyIPqNJSiNb5m8cISJ5qlrnD6bdtYmhQMVgZSfUJ0Q6HPiqZoiohhQiwT7DCs6M22xoE0N1FX0VPjXI98Att3hu7cbgM6zgzLjJeiQxVFvRl3+I9J/wYdghUttnWMGZcZMFSQwFKgYbvnphjRDp/pv36iwQm7+6mP4Tl9JpzEL6T1zK/NXFQT/DCs6M22xoE0P+xWDb/AIk677XOKX9mUyo4y5LKCt47a6NiSd218YJ27ZB586+x8L4c+4/cSnFAeY82rdswsdjroi2dcaEzO7auKVLF98QmTMn7DJ3m1A1icaGNrGiGvSpduEWkJ3VsknAHolNqJp4ZT2SWHj6ad8Que02nxAZO289xQdKUKrmOyomTwOxCVWTaKxHEkBYPQj/dTKHD0NaWuXL2grIgl3TJlRNorEg8RPynqebN0P37r5vDjAXEul8h63gNYnEhjZ+QipBT0nxDZF//zvohKoVkJlkYEHip9YeRFmZZyhTXl71G6rQt2/Q69l8h0kGFiR+gvUUfvfJ63DyyVUHHn00pNu69kBukwxsjsTPo9ldayzTr7HY7rvvPMObAIJN1FpwmIbMgsRP9Tsm7TauYf7fHvY9oZZeiD2cyiQrx4JERL4HvAqcAZwApqvq0yLSGpgNpAOFwI9V9Run2lFdqLd1h2S0Z8jQ/rB9e9XBzz6D886r9fqR3Oo1piFwco6kHHhYVbsDfYF7ReQ8YAzwoap2AT70vnZcyIVhx455JlSrh4hqnSECVtpukpdjQaKqu1Q13/vrb4FNQHvgeuAV72mvAEOcakN1Id3Wff55SE2tev3OO2Gtk7FbvSZZ1csciYikAxnASuB0Vd0FnrARkdOCvGcEMAKgY8eOUbehzt6Cf4XqiRM1j9Uh0ESt3eo1ycDx278i0gx4E3hQVQ+F+j5Vna6qWaqa1a5du6jbEaxXMKBkp29gjBrl6YWEGSJgt3pN8nK0RyIiKXhC5HVVnec9vFtEzvT2Rs4E9jjZhgqBegtLZvyC/9q3o+qkvXtrPukuTHar1yQjJ+/aCDAD2KSqU6r91jvAcGCi9/9vR/tZodyNqX5b98Cer/nszzdV/WZqKpTYhKgxkXJyaNMf+BlwhYis8f53LZ4AuUpECoCrvK8jFs4y/SEZ7fm48z7fEFm2zELEmCg51iNR1RVAsImGK2P1OWHVbgwaBAsXVr2OYELVGFNTwq+1Cal2Y8cOT2BUhMjSpRFPqBpjakr4IKmzduOpp+Dssz2/TkmB0lIYOLCeWmdMckj4IAm2TP9Xl3X09DjGeAtnp071LLY75RQXWmlMw5bwi/YCbUs4qdlO+vWr9nzdnTvhzDNdaqExDV/CBwlUq91QhQEDYMUKz2/8+Mcwe7a7jTMmCTSIIAHgm2+gdeuq1//+d607lxljYqfhBMm//uX5f7t2nqFM44bzpRkT7xrO37Zrr/XckbHJVGPqXcLftakkYiFijEsaTpAYY1xjQWKMiZoFiTEmahYkxpioWZAYY6ImGsbmxm4Rkb3A9jpPdFZbYJ/LbYgF+zriS7x/HWerap17nSZEkMQDEclV1Sy32xEt+zriS0P5OmxoY4yJmgWJMSZqFiShm+52A2LEvo740iC+DpsjMcZEzXokxpioWZAYY6JmQVILEfmeiHwkIptE5DMR+aXbbYqGiDQSkdUissDttkRKRFqKyFwR2ez9vlzsdpsiISKjvT9TG0Rkloik1v2u+GVBUrty4GFV7Q70Be4VkfNcblM0fglscrsRUXoa+F9V7QZcQAJ+PSLSHngAyFLVHkAj4BZ3WxUdC5JaqOouVc33/vpbPD+0CflgXxHpAFwHvOh2WyIlIqcCA/A8ChZV/U5VD7jbqog1BpqISGOgKbDT5fZExYIkRCKSDmQAK91tScSmAo8BJ9xuSBQ6A3uBmd4h2osikuZ2o8KlqsXAZGAHsAs4qKqL3W1VdCxIQiAizYA3gQdV9ZDb7QmXiAwC9qhqntttiVJjIBOYpqoZwBFgjLtNCp+ItAKuBzoBZwFpInKbu62KjgVJHUQkBU+IvK6q89xuT4T6A4NFpBD4O54Hu7/mbpMiUgQUqWpFr3AunmBJND8AtqnqXlUtA+YB/VxuU1QsSGohIoJnPL5JVae43Z5IqepYVe2gqul4JvWWqmrC/Quoql8BX4pIV++hK4GNLjYpUjuAviLS1PszdiUJOGlcXcPZRd4Z/YGfAetFZI332K9VdZGLbUp29wOvi8jJwFbgDpfbEzZVXSkic4F8PHcGV5PgpfJWIm+MiZoNbYwxUbMgMcZEzYLEGBM1CxJjTNQsSIwxUbMgMcZEzYLEGBM1CxLjCBHpIyLrRCRVRNK8e2/0cLtdxhlWkGYcIyLjgVSgCZ41MhNcbpJxiAWJcYy3jH0VUAr0U9XjLjfJOMSGNsZJrYFmQHM8PRPTQFmPxDhGRN7Bs21BJ+BMVb3P5SYZh9jqX+MIEbkdKFfVN0SkEfAvEblCVZe63TYTe9YjMcZEzeZIjDFRsyAxxkTNgsQYEzULEmNM1CxIjDFRsyAxxkTNgsQYE7X/B0sz0u81NkrAAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x6cbaf370>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "x = dataset.Hours\n",
    "y = dataset.Scores\n",
    "y_new = b_0 + b_1 * x\n",
    "\n",
    "plt.figure(figsize=(4, 3))\n",
    "ax = plt.axes()\n",
    "ax.scatter(x, y, label = \"original data\")\n",
    "ax.plot(x, y_new, \"r-\",label=\"regression line\")\n",
    "\n",
    "ax.set_xlabel('x')\n",
    "ax.set_ylabel('y')\n",
    "\n",
    "plt.legend(loc = 'best')\n",
    "\n",
    "plt.show() "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 多元線性迴歸 (Multiple Linear Regression)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "考慮完只有一個自變量$X$後，我們需要考慮有$p$個自變量$X_1, X_2, \\dots $X_p$, 因變量依舊只有一個$Y$， 此時模型則變成了：\n",
    "\n",
    "$$Y = \\beta_0 + \\beta_1 X_1 + \\beta_2 X_2 + \\dots + \\epsilon$$\n",
    "\n",
    "自變量增多以後，我們自然需要考慮：是否隨機這個問題，在這裡我們考慮自變量時非隨機的（不然就很難做了）。將所有的自變量看作一個矩陣，$\\beta$看作一個向量，$Y$看作一個向量，$\\epsilon$看作一個向量，這樣，我們就將模型向量化/矩陣化(matrixlize):\n",
    "\n",
    "$$Y = X\\beta + \\epsilon$$ \n",
    "\n",
    "同樣的，我們可以重新考慮最小二乘法。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$\\begin{align*}\n",
    "    Q &= (X\\beta - y)^T(X\\beta - y) \\\\\n",
    "      &\\vdots \\\\\n",
    "\\end{align*}$\n",
    "\n",
    "可以計算出：$\\hat{\\beta} = (X^T X)^{-1}X^T Y$ "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下面，我們用一個例子來解釋一下代碼部分：\n",
    "\n",
    "下面是一個數據集，我們用多元線性迴歸來预测美国48个州的汽油消耗量(以百万加仑为单位)，其基础是汽油税(以美分为单位)、人均收入(以美元为单位)、铺设的公路(以英里为单位)以及拥有驾照的人口比例。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Index</th>\n",
       "      <th>One</th>\n",
       "      <th>Petrol tax (cents per gallon)</th>\n",
       "      <th>Average income (dollars)</th>\n",
       "      <th>Paved Highways (miles)</th>\n",
       "      <th>Proportion of population with driver's licenses</th>\n",
       "      <th>Consumption of petrol (millions of gallons)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>9.0</td>\n",
       "      <td>3571</td>\n",
       "      <td>1976</td>\n",
       "      <td>0.525</td>\n",
       "      <td>541</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>9.0</td>\n",
       "      <td>4092</td>\n",
       "      <td>1250</td>\n",
       "      <td>0.572</td>\n",
       "      <td>524</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>9.0</td>\n",
       "      <td>3865</td>\n",
       "      <td>1586</td>\n",
       "      <td>0.580</td>\n",
       "      <td>561</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>7.5</td>\n",
       "      <td>4870</td>\n",
       "      <td>2351</td>\n",
       "      <td>0.529</td>\n",
       "      <td>414</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>8.0</td>\n",
       "      <td>4399</td>\n",
       "      <td>431</td>\n",
       "      <td>0.544</td>\n",
       "      <td>410</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Index  One  Petrol tax (cents per gallon)  Average income (dollars)  \\\n",
       "0      1    1                            9.0                      3571   \n",
       "1      2    1                            9.0                      4092   \n",
       "2      3    1                            9.0                      3865   \n",
       "3      4    1                            7.5                      4870   \n",
       "4      5    1                            8.0                      4399   \n",
       "\n",
       "   Paved Highways (miles)  Proportion of population with driver's licenses  \\\n",
       "0                    1976                                            0.525   \n",
       "1                    1250                                            0.572   \n",
       "2                    1586                                            0.580   \n",
       "3                    2351                                            0.529   \n",
       "4                     431                                            0.544   \n",
       "\n",
       "   Consumption of petrol (millions of gallons)  \n",
       "0                                          541  \n",
       "1                                          524  \n",
       "2                                          561  \n",
       "3                                          414  \n",
       "4                                          410  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd \n",
    "import numpy as np \n",
    "import matplotlib.pyplot as plt \n",
    "\n",
    "dataset = pd.read_csv(\"https://raw.githubusercontent.com/TerenceLiu98/Using-R-Series/master/data_set/price.csv\")\n",
    "dataset.shape # dataset's shape\n",
    "dataset.head() # the first five data of the dataset "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Index</th>\n",
       "      <th>One</th>\n",
       "      <th>Petrol tax (cents per gallon)</th>\n",
       "      <th>Average income (dollars)</th>\n",
       "      <th>Paved Highways (miles)</th>\n",
       "      <th>Proportion of population with driver's licenses</th>\n",
       "      <th>Consumption of petrol (millions of gallons)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>48.00</td>\n",
       "      <td>48.0</td>\n",
       "      <td>48.000000</td>\n",
       "      <td>48.000000</td>\n",
       "      <td>48.000000</td>\n",
       "      <td>48.000000</td>\n",
       "      <td>48.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>24.50</td>\n",
       "      <td>1.0</td>\n",
       "      <td>7.668333</td>\n",
       "      <td>4241.833333</td>\n",
       "      <td>5565.416667</td>\n",
       "      <td>0.570333</td>\n",
       "      <td>576.770833</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>14.00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.950770</td>\n",
       "      <td>573.623768</td>\n",
       "      <td>3491.507166</td>\n",
       "      <td>0.055470</td>\n",
       "      <td>111.885816</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>1.00</td>\n",
       "      <td>1.0</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>3063.000000</td>\n",
       "      <td>431.000000</td>\n",
       "      <td>0.451000</td>\n",
       "      <td>344.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>12.75</td>\n",
       "      <td>1.0</td>\n",
       "      <td>7.000000</td>\n",
       "      <td>3739.000000</td>\n",
       "      <td>3110.250000</td>\n",
       "      <td>0.529750</td>\n",
       "      <td>509.500000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>24.50</td>\n",
       "      <td>1.0</td>\n",
       "      <td>7.500000</td>\n",
       "      <td>4298.000000</td>\n",
       "      <td>4735.500000</td>\n",
       "      <td>0.564500</td>\n",
       "      <td>568.500000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>36.25</td>\n",
       "      <td>1.0</td>\n",
       "      <td>8.125000</td>\n",
       "      <td>4578.750000</td>\n",
       "      <td>7156.000000</td>\n",
       "      <td>0.595250</td>\n",
       "      <td>632.750000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>48.00</td>\n",
       "      <td>1.0</td>\n",
       "      <td>10.000000</td>\n",
       "      <td>5342.000000</td>\n",
       "      <td>17782.000000</td>\n",
       "      <td>0.724000</td>\n",
       "      <td>968.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       Index   One  Petrol tax (cents per gallon)  Average income (dollars)  \\\n",
       "count  48.00  48.0                      48.000000                 48.000000   \n",
       "mean   24.50   1.0                       7.668333               4241.833333   \n",
       "std    14.00   0.0                       0.950770                573.623768   \n",
       "min     1.00   1.0                       5.000000               3063.000000   \n",
       "25%    12.75   1.0                       7.000000               3739.000000   \n",
       "50%    24.50   1.0                       7.500000               4298.000000   \n",
       "75%    36.25   1.0                       8.125000               4578.750000   \n",
       "max    48.00   1.0                      10.000000               5342.000000   \n",
       "\n",
       "       Paved Highways (miles)  \\\n",
       "count               48.000000   \n",
       "mean              5565.416667   \n",
       "std               3491.507166   \n",
       "min                431.000000   \n",
       "25%               3110.250000   \n",
       "50%               4735.500000   \n",
       "75%               7156.000000   \n",
       "max              17782.000000   \n",
       "\n",
       "       Proportion of population with driver's licenses  \\\n",
       "count                                        48.000000   \n",
       "mean                                          0.570333   \n",
       "std                                           0.055470   \n",
       "min                                           0.451000   \n",
       "25%                                           0.529750   \n",
       "50%                                           0.564500   \n",
       "75%                                           0.595250   \n",
       "max                                           0.724000   \n",
       "\n",
       "       Consumption of petrol (millions of gallons)  \n",
       "count                                    48.000000  \n",
       "mean                                    576.770833  \n",
       "std                                     111.885816  \n",
       "min                                     344.000000  \n",
       "25%                                     509.500000  \n",
       "50%                                     568.500000  \n",
       "75%                                     632.750000  \n",
       "max                                     968.000000  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset.describe() # to see some statistical details"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['Index', 'One', 'Petrol tax (cents per gallon)',\n",
       "       'Average income (dollars)', 'Paved Highways (miles)',\n",
       "       'Proportion of population with driver's licenses',\n",
       "       'Consumption of petrol (millions of gallons)'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset.dtypes.index # get the row's name of the dataset, this can make next step easier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = dataset[['Petrol tax (cents per gallon)',\n",
    "       'Average income (dollars)', 'Paved Highways (miles)',\n",
    "       \"Proportion of population with driver's licenses\"]]\n",
    "y = dataset['Consumption of petrol (millions of gallons)']  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "為了訓練模型，我們使用 `fitted`："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split  \n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如一元線性迴歸一樣，在多元線性迴歸的情況下，回歸模型必須找到所有參數的最優係數："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Coefficient</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Petrol tax (cents per gallon)</th>\n",
       "      <td>-40.016660</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Average income (dollars)</th>\n",
       "      <td>-0.065413</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Paved Highways (miles)</th>\n",
       "      <td>-0.004741</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Proportion of population with driver's licenses</th>\n",
       "      <td>1341.862121</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                 Coefficient\n",
       "Petrol tax (cents per gallon)                     -40.016660\n",
       "Average income (dollars)                           -0.065413\n",
       "Paved Highways (miles)                             -0.004741\n",
       "Proportion of population with driver's licenses  1341.862121"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.linear_model import LinearRegression  \n",
    "regressor = LinearRegression()  \n",
    "regressor.fit(X_train, y_train) \n",
    "\n",
    "coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])  \n",
    "coeff_df  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下面是應用模型進行預測："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_pred = regressor.predict(X_test)  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Actual</th>\n",
       "      <th>Predicted</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>534</td>\n",
       "      <td>469.391989</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>410</td>\n",
       "      <td>545.645464</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>577</td>\n",
       "      <td>589.668394</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30</th>\n",
       "      <td>571</td>\n",
       "      <td>569.730413</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32</th>\n",
       "      <td>577</td>\n",
       "      <td>649.774809</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>37</th>\n",
       "      <td>704</td>\n",
       "      <td>646.631164</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34</th>\n",
       "      <td>487</td>\n",
       "      <td>511.608148</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>40</th>\n",
       "      <td>587</td>\n",
       "      <td>672.475177</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>467</td>\n",
       "      <td>502.074782</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>580</td>\n",
       "      <td>501.270734</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    Actual   Predicted\n",
       "29     534  469.391989\n",
       "4      410  545.645464\n",
       "26     577  589.668394\n",
       "30     571  569.730413\n",
       "32     577  649.774809\n",
       "37     704  646.631164\n",
       "34     487  511.608148\n",
       "40     587  672.475177\n",
       "7      467  502.074782\n",
       "10     580  501.270734"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})  \n",
    "df  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mean Absolute Error: 56.82224747896468\n",
      "Mean Squared Error: 4666.344787588359\n",
      "Root Mean Squared Error: 68.31064915215167\n"
     ]
    }
   ],
   "source": [
    "from sklearn import metrics  \n",
    "print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  \n",
    "print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  \n",
    "print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Logistic Regression （逻辑斯蒂回归）\n",
    "\n",
    "*奇奇怪怪的音譯*\n",
    "\n",
    "到目前為止，我們已經接觸了 **Simple Linear Regression** 和 **Multiple Regression** \n",
    "\n",
    "* Simple linear regression\n",
    "    * Relationship between numerical response and a numerical or categorical predictor\n",
    "* Multiple regression \n",
    "    * Relationship between numerical response and multiple numerical and/or categorical predictors.\n",
    "\n",
    "但是，有沒有想過，如果自變量(predictors)是一些非常奇怪的數：非線性，又或者自變量之間有非常大相關性（高共線性），又或者，因變量(response)非常奇怪呢，比如自變量是屬性變量。\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 回顧所學知識\n",
    "\n",
    "在統計課程裡，關於回歸，我們學過什麼？\n",
    "* Model parameter interpretation & estimation \n",
    "* Hypothesis tests for slope and intercept parameters \n",
    "* Hypothesis tests for all regression parameters \n",
    "* Confidence intervals for regression parameters \n",
    "* Confidence and prediction intervals for predicted means and values(SLR only)\n",
    "* Model diagnostics, residuals plot, outliers\n",
    "* $R^2$, Adjusted $R^2$ \n",
    "* Model selection (MLR only)\n",
    "* Simple transformations \n",
    "\n",
    "### 發生比 (Odds)\n",
    "\n",
    "這是另外一種方式來量化某件事情發生的概率。一般的，這個概念會用在賭博上(logistic regression 也會用到)\n",
    "\n",
    "Odds 就是指， 假設事件$E$，\n",
    "\n",
    "$$ odds(E) = \\frac{P(E)}{P(E^C)} = \\frac{P(E)}{1 - P(E)} $$\n",
    "\n",
    "如果我們知道發生的次數是$x$、不發生的次數用$y$表示的話：\n",
    "\n",
    "$$ odds(E) = \\frac{x}{y} = \\frac{x / (x + y)}{y / (x + y)}$$\n",
    "\n",
    "這也就代表著：\n",
    "\n",
    "$$P(E) = \\frac{x}{x + y}, \\ \\ P(E^C) = \\frac{y}{x + y}$$\n",
    "\n",
    "可以看出 $odds(p)$ 是一個關於$p$的一個遞增函數；如果我們對 odds 取對數，也就是$log\\frac{p}{1 - p}$，此時，依舊是一個遞增函數，取對數之後，就可以將除法轉換成為減法：$logit(p) = p - (1 - p) = 2p - 1$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "這是一種非常廣義的思路去理解回歸，而這種思路被稱為**廣義線性模型(Generalized Linear Regression Models - GLMs)** 而邏輯斯蒂回歸這是其中的一種。\n",
    "\n",
    "GLM 中統一了線性迴歸模型與允許納入非正態因變量分佈的非線性迴歸模型。\n",
    "\n",
    "洛基斯蒂回歸的情形是因變量只有兩種可能的結果，一般稱為「成功」和「失敗」， 表示$0$和$1$。\n",
    "\n",
    "### 邏輯斯蒂回歸模型 \n",
    "\n",
    "#### Binomial Dependent Variable's Model\n",
    "\n",
    "**響應變量只有 $0$ 和 $1$**。 \n",
    "\n",
    "假設回歸模型的形式：\n",
    "\n",
    "$$y_i = x_i'\\beta + \\epsilon_i$$\n",
    "\n",
    "式子中：$x_i' = [1, x_{i1}, x_{i2}, \\dots, x_{ik}], \\ \\ \\ \\beta' = [\\beta_0, \\beta_1, \\dots, \\beta_k]$, 而 $y_i$ 的取值要麼是 $0$ 要們為 $1$。 \n",
    "\n",
    "$$ y_i \\left\\{\n",
    "\\begin{aligned}\n",
    "&1 \\ \\text{when} P(y_i = 1) - \\pi \\\\\n",
    "&0 \\ \\text{when} P(y_i = 0) = 1- \\pi\n",
    "\\end{aligned}\n",
    "\\right.\n",
    "$$\n",
    "\n",
    "現在由於$E(\\epsilon_i) = 0$，所以因變量的期望值為：$E(y_i) = 1(\\pi_i) + 0 ( 1 - \\pi_i) = \\pi_i$，這意味著 $E(y_i) = x_i'\\beta = \\pi$ 這個式子說明了因變量的期望由因變量變量函數$E(y_i) = x_i'\\beta$給出，而該期望也恰是因變量值等於$1$時的概率只能取兩個值：\n",
    "\n",
    "$$ \\epsilon_i \\left\\{\n",
    "\\begin{aligned}\n",
    "&1 - x_i'\\beta  \\ \\ \\text{when} y_i = 1 \\\\\n",
    "&- x_i'\\beta \\ \\ \\text{when} y_i = 0 \n",
    "\\end{aligned}\n",
    "\\right.\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "因此，這個模型的誤差不可能是正態的；第二，誤差的方差不是常數：這是因為：$\\sigma^2_{yi} = E\\{y_i - E(y_i)\\}^2 = (1 - \\pi_i)^2 \\pi_i + (0 - \\pi_i)^2(1 - \\pi_i) = \\pi_i(1 - \\pi_i)$ 注意到$E(y_i) = x_i'\\beta = \\pi_i$，所以上一表達式就是 \n",
    "\n",
    "$$\\sigma^2_{yi} = E(y_i)[1 - E(y_i)]$$\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "這個式子說明了，觀測值的方差是一個關於均值的函數，又因為$0 \\leq E(y_i) = \\pi_i < 1$ 所以因變量函數存在一個約束。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我們可以讓$log\\frac{p(x)}{1 - p(x)} = \\beta_0 + x'\\beta$ 這就是邏輯斯蒂模型 $\\Rightarrow = \\mathbf{\\frac{1}{1 + exp[-x'\\beta]}}$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Likelihood Function for Logistic Regression\n",
    "\n",
    "Because logistic regression predicts probabilities, rather than just classes, we can fit it using likelihood. For each training data-point, we have a vector of features, $x_i$, and an observed calss, $y_i$. The probability of that class was either $p$, if $y_i = 1$, or $1 - p$, if $y_i = 0$. The likelihood is then \n",
    "\n",
    "$$\\displaystyle L(\\beta_0, \\beta) = \\prod_{i=1}^n p(x_i)^{y_i}(1 - p(x_i)^{1 - y_i}$$ \n",
    "\n",
    "The log-likelihood turns product into sums:\n",
    "\n",
    "$\\begin{align*}\n",
    "\\boldsymbol{L}(\\beta_0, \\beta) &= \\sum_{i=1}^n y_i log p(x_i) + (1 - y_i) log 1 - p(x_i) \\\\ \n",
    "                  &= \\sum_{i=1}^n log 1 - p(x_i) + \\sum_{i = 1}^n y_i log \\frac{p(x_i)}{1 - p(x_i)} \\\\\n",
    "                  &= -\\sum_{i=1}^n log 1 - exp[\\beta_0 + x_i \\beta + \\sum_{i = 1}^n (\\beta_0 + x_i \\beta) \\\\\n",
    "\\end{align*}$\n",
    "                  \n",
    "Typically, to find the **maximum likelihood estimates we'd differentiate the log likelihood with respect to the parameters**, set the derivatives equal to zero, and solve. To start that, take the derivative with respect to one component of $\\beta$, seay $\\beta_j$: \n",
    "\n",
    "$\\begin{align*}\n",
    "    \\frac{\\partial L}{\\partial \\beta_j} = -\\sum_{i = 1}^n \\frac{1}{1 + e^{\\beta_0 + x_i \\beta}}x_{ij} + \\sum_{i = 1}^n y_i x_{ij} \\\\\n",
    "    = \\sum_{i = 1}^n (y_i - (x_i; \\beta_0, \\beta))x_{ij} \\\\\n",
    "\\end{align*}$\n",
    "\n",
    "We are not going to be able to set this to zero and solve exactly. (That's a transcendental equation, and there is no closed-form solution.) We can however approximately solve it numerically, by numerical methods. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Code "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/usr/local/lib/python3.5/dist-packages/sklearn/linear_model/logistic.py:758: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.\n",
      "  \"of iterations.\", ConvergenceWarning)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0.9733333333333334"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.datasets import load_iris\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "X, y = load_iris(return_X_y=True)\n",
    "clf = LogisticRegression(random_state=0, solver='lbfgs',\n",
    "                         multi_class='multinomial').fit(X, y)\n",
    "clf.predict(X[:2, :])\n",
    "\n",
    "clf.predict_proba(X[:2, :]) \n",
    "\n",
    "\n",
    "clf.score(X, y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
